Background: Evaluating the pathogenicity of a variant is challenging given the plethora of types of genetic evidence that laboratories must consider. Deciding how to weigh each type of evidence is difficult and time consuming, given that there are several databases collecting variant information, in-silico prediction tools and population frequency databases. Although frameworks such as ACMG/AMP and CAP provide guidelines, variant classification remains largely a manual and time-consuming process in clinical practice, particularly for variants with low occurrence frequency.

Aims: We set out to evaluate the performance of a machine learning model, which consumes a variety of information for a given variant to compute a consensus classification in a 3-tier system (pathogenic, variant of unknown significance (VUS), benign) for low frequency variants in myeloid and lymphatic malignancies.

Method: First, we trained a random forest classifier based on selected 12,500 low frequency variants manually classified and curated by highly experienced molecular biologists and hematologists out of 50,000 samples sequenced in our lab during routine operations between 2011 and 2025. The hand-picked dataset consisted of 5,000 pathogenic variants, 2,500 VUS and 5,000 benign variants. In addition to our own classification label, the model has real time access to COSMIC mutation database (v91), CLINVAR (v2025-04), gnomAD (v2.1.1) population frequency database and dbNSFP (v3.5), which is a collection of results of in-silico prediction tools. Selected fields from the four external databases were used as features to train the model along with our own classification.

Secondly, for this study we randomly selected 500 variants from our database which we have observed less than 50 times (range 2 - 50) among 374,821 samples during routine diagnostics over the last 14 years to evaluate the applicability in a real-world clinical setting.

Results: First, we evaluated the model's performance on a hold out test set (10%) from the initial 12,500 variants. Using high confidence predictions (i.e. >70% likelihood) the model achieved an accuracy of 86%. Results are transparent, i.e. we employ a library called SHAP to assess which features contributed to the final label.

In a second step we inferred the results for the 500 randomly selected variants. These represented a set which is difficult to assess in a routine environment, as they are rare and hence do not exhibit much additional or conflicting information. Assessment could take many minutes for an experienced scientist.

Despite querying multiple sources, the algorithm computes within a few seconds. We only considered predictions of 70% or higher, which were achieved for 369 variants. 89% (n=329) were concordant with the previous manually curated result. 5 pathogenic missense mutations were incorrectly classified as VUS. 9 benign mutations were incorrectly classified as pathogenic (n=7) or VUS (n=2).

Notably, in 26 instances (24 missense variants and 2 in-frame deletions) the manual curation was VUS, but the algorithm confidently classified those in either benign (n=13) and pathogenic (n=13). We looked specifically at these discrepancies and would retrospectively concur with the model assessment in these instances due to multiple samples with homo/heterozygous mutation load or presence of frequency information in gnomAD. For example, RUNX1:c.1172T>G was manually classified as VUS, but in all 12 samples we saw this variant its allele frequency (VAF) was at 50%, indicating a benign variant. ZRSR2:c.934T>A was predicted pathogenic. In 3 samples we observed a VAF of (3%, 80% and 98%) and no presence in gnomAD, which endorses the model output.

Overall, the automated consensus classification using all relevant sources reduces the time an interpreter needs to evaluate a rare variant from many minutes to seconds.

Conclusion: With the growing volume and sometimes conflicting nature of variant data, systems are needed to summarize information, such that a human can interrogate them quickly. Here we present an AI based algorithm that can robustly synthesize evidence from multiple curated sources including our exhaustive in-house data and in-silico predictions to provide a consensus classification, even where evidence may be limited or conflicting. Our approach promises to streamline variant interpretation workflows, enhance reproducibility, and support the increasing throughput of modern clinical genomics.

This content is only available as a PDF.
Sign in via your Institution